Automated document filing and processing methods and systems

ABSTRACT

Systems, methods and computer program products for automatically ingesting and filing documents in a database having a plurality of file locations. An electronic file having one or more documents is received. For each document in the received file, text data is identified and used to generate a plurality of suggested file locations for the received documents. Machine learning systems may be used to enhance the accuracy of suggested file locations.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.17/108,377, filed Dec. 1, 2020, which is a continuation of U.S. patentapplication Ser. No. 15/693,584, filed Sep. 1, 2017 and issued as U.S.Pat. No. 10,884,979 on Jan. 5, 2021, which claims the benefit of U.S.Provisional Patent Application No. 62,383,284, filed Sep. 2, 2016, theentire disclosures of which are each incorporated herein by reference.

FIELD

The described embodiments relate to electronic document management, andin particular systems, methods and computer program products for filingdocuments in a database.

BACKGROUND

People often use filing cabinets and file folders to store importantdocuments. To make these filing systems useful, the folders anddocuments must be carefully organized and managed to make documentretrieval convenient and easy. With electronic documents, databases withfolder structures can be used to store documents. As with physicaldocuments, organizing and managing the database is important in ensuringthat documents are easily retrievable.

Managing electronic document databases can be a tedious andtime-consuming task. Because electronic documents are easy to create anddisseminate, large numbers of documents may be filed in electronicdatabases. The increased number of documents often results in anincreased number of file folders and potential file locations, making itmore difficult to identify the best filing location to store aparticular document. As a result, individuals may neglect to file theirdocuments, resulting in large numbers of unfiled documents that a userhas to navigate to find a desired document.

Another difficulty when managing electronic databases is that thedocuments may be created with generic and/or non-descriptive names. Forinstance, document management systems, scanners and cameras may generateelectronic documents with file names that appear random to a user ormerely indicate the date on which the document was created. This makesit difficult for users to identify appropriate filing locations forthese documents.

In some cases, electronic files can be generated that include aplurality of distinct documents. Because it is inconvenient to scanmultiple documents separately, a series of documents may be scanned as abatch into a single electronic file. These multi-document files may havenon-descriptive or generic names and may not provide any indication thatmultiple documents are included in the file. To properly file suchdocuments, users may be required to review each file, identify andseparate the distinct documents, create and name multiple individualdocuments files, and store each document to the appropriate filelocation. Again, this can result in a significant number of unfileddocuments or a poorly organized database.

SUMMARY

In a broad aspect, there is provided a method for automatic ingestionand filing of documents in a database having a plurality of filelocations. The method can include receiving an electronic file includingat least one document; for each document in the at least one document:identifying text data in the document; and generating a plurality ofsuggested file locations for each respective document.

In some embodiments, generating the plurality of suggested filelocations may comprise processing the text data at a master node togenerate a plurality of suggested file locations, wherein the masternode is a machine learning node common to a plurality of users;processing the text data at a client node to refine the plurality ofsuggested file locations for one of the plurality of users, wherein theclient node is a machine learning node specific to the one of theplurality of users.

In some embodiments, the plurality of suggested file locations generatedat the master node comprises first or second level file locations in ahierarchy.

In some embodiments, the plurality of suggested file locations generatedat the master node comprises third or higher level file locations in ahierarchy.

In some embodiments, generating the plurality of suggested filelocations comprises: comparing the one or more document keywords to acorpus of stored keywords, the corpus of stored keywords previouslygenerated based on a plurality of documents in the database, whereineach of the stored keywords in the corpus has at least one file locationassociation identifying a file location associated therewith; andgenerating a plurality of keyword scores based on the comparison of theone or more document keywords and the corpus.

In some embodiments, the method may further include identifying aplurality of pages in the file; determining a plurality of page markersfor each page; determining that the at least one document in the fileincludes a plurality of distinct documents; and assigning each page toone of the distinct documents by grouping the plurality of pages intothe distinct documents by comparing the page markers for the pluralityof pages.

In some embodiments, the page markers can include image-based pagemarkers derived from a visual appearance of the page. In someembodiments, the page markers can include text-based page markersderived from the text data in the document.

In some embodiments, the method may further include for each storedkeyword in the corpus of stored keywords, determining alocation-specific weighting for each file location association; andgenerating the plurality of suggested file locations by weighting theplurality of keyword scores using the location-specific weightings.

In some embodiments, the database can be arranged into a file directoryhaving a plurality of folder levels with each file location in theplurality of file locations associated with a particular folder level,and the location-specific weighting for each file location associationcan be determined using the folder level of the file locationcorresponding to that file location association.

In some embodiments, the method may further include for each document inthe at least one document: determining a keyword coefficient for each ofthe document keywords in the text data, each keyword coefficientindicating a measure of importance of the corresponding document keywordto the document; and generating the plurality of keyword scores usingthe keyword coefficient.

In some embodiments, the measure of importance of the correspondingdocument keyword to the document can be determined by: identifyingkeyword text attributes for the document keyword, the keyword textattributes including at least one of a text size, a text location and atext format; and determining the keyword coefficient for the documentkeyword in the text data based on the keyword text attributes.

In some embodiments, identifying the text data may include performingoptical character recognition on the document to identify the text data.

In some embodiments, the method may further include determining arecommended file name for one of the received documents by: determininga keyword coefficient for each of the document keywords in the textdata, each keyword coefficient indicating a measure of importance of thecorresponding document keyword to the document; and determining therecommended file name using the keyword coefficients of the documentkeywords.

In another broad aspect, there is provided a computer program productfor automatic ingestion and filing of documents in a database. Thecomputer program product can include a non-transitory computer readablestorage medium and computer-executable instructions stored on thecomputer readable storage medium. The instructions can configure aprocessor to: receive an electronic file including at least onedocument; for each document in the at least one document: identify textdata in the document; and generate a plurality of suggested filelocations for each respective document.

In some embodiments, generating the plurality of suggested filelocations comprises: processing the text data at a master node togenerate a plurality of suggested file locations, wherein the masternode is a machine learning node common to a plurality of users;processing the text data at a client node to refine the plurality ofsuggested file locations for one of the plurality of users, wherein theclient node is a machine learning node specific to the one of theplurality of users.

In some embodiments, the plurality of suggested file locations generatedat the master node comprises first or second level file locations in ahierarchy.

In some embodiments, the plurality of suggested file locations generatedat the master node comprises third or higher level file locations in ahierarchy.

In some embodiments, generating the plurality of suggested filelocations comprises: comparing the one or more document keywords to acorpus of stored keywords, the corpus of stored keywords previouslygenerated based on a plurality of documents in the database, whereineach of the stored keywords in the corpus has at least one file locationassociation identifying a file location associated therewith; andgenerating a plurality of keyword scores based on the comparison of theone or more document keywords and the corpus.

In some embodiments, the computer program product may also includeinstructions for configuring the processor to: identify a plurality ofpages in the file; determine a plurality of page markers for each page;determine that the at least one document in the file includes aplurality of distinct documents; and assign each page to one of thedistinct documents by grouping the plurality of pages into the distinctdocuments by comparing the page markers for the plurality of pages.

In some embodiments, the page markers can include image-based pagemarkers derived from a visual appearance of the page. In someembodiments the page markers can include text-based page markers derivedfrom the text data in the document.

In some embodiments, the computer program product may also includeinstructions for configuring the processor to: for each stored keywordin the corpus of stored keywords, determine a location-specificweighting for each file location association; and generate the pluralityof suggested file locations by weighting the plurality of keyword scoresusing the location-specific weightings.

In some embodiments, the database may be arranged into a file directoryhaving a plurality of folder levels with each file location in theplurality of file locations associated with a particular folder level,and the computer program product may further include instructions forconfiguring the processor to determine the location-specific weightingfor each file location association using the folder level of the filelocation corresponding to that file location association.

In some embodiments, the computer program product may also includeinstructions for configuring the processor to, for each document in theat least one document: determine a keyword coefficient for each of thedocument keywords in the text data, each keyword coefficient indicatinga measure of importance of the corresponding document keyword to thedocument; and generate the plurality of keyword scores using the keywordcoefficient.

In some embodiments, the computer program product may also includeinstructions for configuring the processor to determine the measure ofimportance of the corresponding document keyword to the document by:identifying keyword text attributes for the document keyword, thekeyword text attributes including at least one of a text size, a textlocation and a text format; and determining the keyword coefficient forthe document keyword in the text data based on the keyword textattributes.

In some embodiments, the computer program product may also includeinstructions for configuring the processor to identify the text data byperforming optical character recognition on the document.

In some embodiments, the computer program product may also includeinstructions for configuring the processor to determine a recommendedfile name for one of the received documents by: determining a keywordcoefficient for each of the document keywords in the text data, eachkeyword coefficient indicating a measure of importance of thecorresponding document keyword to the document; and determining therecommended file name using the keyword coefficients of the documentkeywords.

BRIEF DESCRIPTION OF THE DRAWINGS

A preferred embodiment of the present invention will now be described indetail with reference to the drawings, in which:

FIG. 1A illustrates a system for filing electronic documents in adatabase in accordance with an example embodiment;

FIG. 1B illustrates a system for filing electronic documents in adatabase in accordance with another example embodiment;

FIG. 2A illustrates a method for generating suggested file locations forfiling electronic documents in a database in accordance with theembodiment of FIG. 1A;

FIG. 2B illustrates a method for generating suggested file locations forfiling electronic documents in a database in accordance with theembodiment of FIG. 1B.

FIG. 3 illustrates a method for ingesting electronic documents inaccordance with some embodiments;

FIG. 4 illustrates a process for ingesting electronic documents inaccordance with some embodiments;

FIG. 5 illustrates a screenshot of a suggested file location userinterface in accordance with some embodiments;

FIG. 6 illustrates a screenshot of a keyword weighting user interface inaccordance with some embodiments;

FIG. 7 illustrates a screenshot of a keyword definition user interfacein accordance with some embodiments; and

FIG. 8 illustrates a flow diagram used for machine learning inaccordance with the embodiment of FIG. 1B.

DESCRIPTION OF EXEMPLARY EMBODIMENTS

It will be appreciated that for simplicity and clarity of illustration,where considered appropriate, reference numerals may be repeated amongthe figures to indicate corresponding or analogous elements or steps. Inaddition, numerous specific details are set forth in order to provide athorough understanding of the exemplary embodiments described herein.However, it will be understood by those of ordinary skill in the artthat the embodiments described herein may be practiced without thesespecific details. In other instances, well-known methods, procedures andcomponents have not been described in detail since these are known tothose skilled in the art. Furthermore, it should be noted that thisdescription is not intended to limit the scope of the embodimentsdescribed herein, but rather as merely describing one or more exemplaryimplementations.

It should also be noted that the terms “coupled” or “coupling” as usedherein can have several different meanings depending in the context inwhich these terms are used. For example, the terms coupled or couplingmay be used to indicate that an element or device can electrically,optically, or wirelessly send data to another element or device as wellas receive data from another element or device.

It should be noted that terms of degree such as “substantially”, “about”and “approximately” as used herein mean a reasonable amount of deviationof the modified term such that the end result is not significantlychanged. These terms of degree may also be construed as including adeviation of the modified term if this deviation would not negate themeaning of the term it modifies.

Furthermore, any recitation of numerical ranges by endpoints hereinincludes all numbers and fractions subsumed within that range (e.g. 1 to5 includes 1, 1.5, 2, 2.75, 3, 3.90, 4, and 5). It is also to beunderstood that all numbers and fractions thereof are presumed to bemodified by the term “about” which means a variation of up to a certainamount of the number to which reference is being made if the end resultis not significantly changed.

The example embodiments of the systems and methods described herein maybe implemented as a combination of hardware or software. In some cases,the example embodiments described herein may be implemented, at least inpart, by using one or more computer programs, executing on one or moreprogrammable devices comprising at least one processing element, and adata storage element (including volatile memory, non-volatile memory,storage elements, or any combination thereof). These devices may alsohave at least one input device (e.g. a pushbutton keyboard, mouse, atouchscreen, and the like), and at least one output device (e.g. adisplay screen, a printer, a wireless radio, and the like) depending onthe nature of the device.

It should also be noted that there may be some elements that are used toimplement at least part of one of the embodiments described herein thatmay be implemented via software that is written in a high-level computerprogramming language such as object oriented programming. Accordingly,the program code may be written in C, C++ or any other suitableprogramming language and may comprise modules or classes, as is known tothose skilled in object oriented programming. Alternatively, or inaddition thereto, some of these elements implemented via software may bewritten in assembly language, machine language or firmware as needed. Ineither case, the language may be a compiled or interpreted language.

At least some of these software programs may be stored on a storagemedia (e.g. a computer readable medium such as, but not limited to, ROM,magnetic disk, optical disc) or a device that is readable by a generalor special purpose programmable device. The software program code, whenread by the programmable device, configures the programmable device tooperate in a new, specific and predefined manner in order to perform atleast one of the methods described herein.

Furthermore, at least some of the programs associated with the systemsand methods of the embodiments described herein may be capable of beingdistributed in a computer program product comprising a computer readablemedium that bears computer usable instructions for one or moreprocessors. The medium may be provided in various forms, includingnon-transitory forms such as, but not limited to, one or more diskettes,compact disks, tapes, chips, and magnetic and electronic storage.

Embodiments of the systems, methods and computer program productsdescribed herein may facilitate filing and managing electronic documentsin a database. In general, the embodiments described herein may providefor automatic ingestion, management and filing of one or more documentsin a database having a plurality of file locations. In some embodiments,a cloud based document management or bookkeeping system is provided.

The embodiments described herein may involve receiving one or moredocuments. The documents may be received in various formats, such asemail attachments, documents uploaded and/or moved between computingdevices or between applications on a computing device, and/or documentsgenerated using scanners or digital cameras for example.

Manual approaches to document filing can be time consuming and mayresult in documents being filed to sub-optimal file locations (e.g. thefirst somewhat relevant folder a user sees). The embodiments describedherein may provide a structured bookkeeping filing system that automatesdigital document storage to allow users to quickly and accurately storeand organize their important documents digitally.

The embodiments described herein may provide improved techniques fororganizing and storing such received documents by determining potentialfiling locations. The potential filing locations may be sorted or rankedand used to generate suggested filing locations. The suggested filinglocations may be displayed to a user to allow the user to file adocument. The suggested filing locations may also be used toautomatically file a document.

Text data from received documents can be compared with stored keywords.Each stored keyword is associated with one or more filing locations. Thestored keywords may include the folder names in a fixed directorystructure, keywords assigned to the folders within the directorystructure, and text data of previously saved documents. The comparisoncan be used to generate keyword scores indicating the relevance of thestored keywords to the text data in a document. The keyword scores forthe stored keywords, and their associations with particular filinglocations, can be used to determine the suggested filing locations.

Embodiments described herein may also generate recommended file namesand file locations for the electronic documents. In some cases, therecommended file names may be initial file names for newly createddocuments, or recommended modifications to existing file names (e.g.where generic or non-descriptive file names are used).

In some cases, files may be received that include multiple documentswithin a single file. These multi-document files may be separated intoseparate files for each document by grouping the pages in the file intodistinct documents. The grouping can be done based on page markersderived from the pages in the document. The page markers may includeimage-based page markers derived from the visual appearance of the page.The page markers may also include text-based page markers determinedfrom the text data in the file. A distinct file name may be generatedfor each of the separate files. File locations may also be determinedand suggested for each file.

To identify suggested filing locations and/or recommended file namestext data can be identified in a received document. For instance, if thedocument is an electronically created document then the text data may beautomatically identified because it is already in a format recognizableto the computing system. In other cases, e.g. where documents arescanned or generated by a digital camera, techniques such as opticalcharacter recognition may be used to identify the text data.

In some embodiments, once text data has been identified in a document,the text data can be indexed to identify one or more document keywords.Indexing the text data may include identifying a plurality of documentkeywords in the document text. The document keywords may be identifiedwhile excluding various commonly used words. For instance, articles maybe excluded from being considered document keywords. The indexing mayalso include determining a word occurrence level. The word occurrencelevel may be an absolute number of times the word is present in thedocument. Alternatively, the word occurrence level may be a relativemeasure of how often the word is present in the document.

In some cases, words that are present in the document more than akeyword threshold number of times may be identified as documentkeywords. That is, the word occurrence level may need to meet thekeyword threshold in order to be considered a document keyword. Thekeyword threshold may be determined based on the length of the documentor other potential keywords in a document. In some cases, the keywordthreshold may be an absolute keyword threshold, e.g., 5 or 10 times perpage. In other cases, the keyword threshold may be a relative keywordthreshold, e.g., the 5 or 10 most prevalent potential keywords.

The document keywords can be compared to a corpus of stored keywords.The corpus of stored keywords can be generated using documentspreviously stored in the database. For example, the corpus of storedkeywords may include keywords determined from the file name ofpreviously stored documents and/or document keywords identified inpreviously stored documents. The corpus of stored keywords may also bedetermined from attributes of the database directory.

For example, the corpus of stored keywords may include keywordsdetermined from folder names and/or file location names. In some cases,the corpus of stored keywords may include user-defined keywords. A usermay enter keywords to be associated with specific files or folders. Insome cases, keywords may be automatically pre-populated into the corpusof keywords and associated with file locations/folders (e.g., thekeyword “IRS” may be associated with a folder for tax documents).

Each of the stored keywords in the corpus has at least one file locationassociation that identifies a file location associated with that storedkeyword. In some cases, a stored keyword may have two or more filelocation associations identifying different file locations. The filelocation associations may be generated automatically, e.g. based on thedocument keywords, or document file name of documents previously filedto a particular file location. In some cases, file location associationsmay also be generated manually when user-defined keywords are entered byusers to be associated with particular file locations.

Based on the comparison of the one or more document keywords and thecorpus, a plurality of keyword scores may be generated. The plurality ofkeyword scores may indicate relevance or a match between the documentkeywords in a particular document and one or more stored keywords in thecorpus. The plurality of keyword scores may then be used to generate aplurality of suggested file locations using the file locationassociations of the corresponding keywords.

In some embodiments, documents and/or their text data may be input to anartificial intelligence (AI) or machine learning system, which can betrained to identify associations between portions of the text data andfile names or file locations, and to output suggested file names or filelocations following analysis of each document.

The machine learning system can be pre-trained using preset file namingconventions and example folder hierarchies. Such preset file namingconventions may in some cases consider multiple naming conventions, toaccount for differences in how some users will choose to name theirfiles.

When outputting predictions for file names and locations, it may bedifficult to account for the preferences of a wide number of users.Attempting to do so may lead to large storage requirements, slowerprocessing times, and reduced accuracy. For example, one user may preferto have file names that include spaces, whereas another user may preferto replace spaces in a file name with underscore characters. One usermay prefer to store dates in a file name using the ISO 8601 format(e.g., yyyy-mm-dd), while another user may prefer dates to be formattedusing a local convention, such as the month-day-year convention that isprevalent in the United States. Still other users may prefer to includecertain keywords in a file name (e.g., institution name, account number,etc.).

The wide variety of user preferences may make it difficult for a singlemachine learning node to accurately predict file names and filelocations for a wide variety of users. Accordingly, in some embodiments,multiple connected machine learning nodes may be used to makepredictions. In particular, there may be a master node that is common toa plurality of users, and there may be a client node that is common to asingle user (or a subset of the plurality of users).

The master node may be trained using a generic data set, and may be usedto output predictions of only high-level file locations and file names.For example, the master node may predict file locations at a first leveland a second level (e.g., “/Banking/Bank Accounts”), and may predictcertain elements that may be used in a file name. The client node maythen predict the third and subsequent levels of the file locations,along with the predicted file name, which will be based on the clientnode's learned associations.

The system may thus be considered a multi-node machine learning system,in which learning and processing is distributed across multiple nodes.In addition, learned knowledge may be pushed back from client nodes tothe master node in a feedback process, to assist the master node inimproving its first and second level predictions. This frees the clientnodes to adapt and specialize their predictions to a user's particulartastes.

The machine learning system may be a supervised AI or an unsupervisedAI.

In some cases, documents may be automatically filed to the suggestedfile locations, at least temporarily. In other cases, a user may beprompted to approve the suggested file location or identify another filelocation before the document is stored to a file location. If a userchooses to defer selecting a file location, the document may betemporarily filed to the suggested filing location, or alternatively thedocument may be stored in an unfiled document folder. A user may beperiodically prompted to select or approve the file location fordocuments for which filing was deferred. In embodiments that use machinelearning, user selections may be fed back to the local and master nodesto improve future prediction performance.

In embodiments that use a keyword corpus, the stored keywords in thecorpus may have location-specific weightings for each of theircorresponding file associations. The location-specific weightings may beused to generate the suggested file locations by weighting the keywordscores for particular file locations.

The location-specific weightings may indicate the relevance of thestored keyword to a specific location. That is, a file association maybe given a higher location-specific weighting when the stored keyword ismore relevant to the particular file location. For example, wheremultiple stored keywords are associated with a particular file location,the keywords may be scored and/or ranked to indicate the relevance ofthat keyword to the particular file location.

In other cases, the location-specific weighting may be determined basedon the particular file location. For example, the database may include anumber of folder levels (e.g. categories and sub-categories) withsub-folders nested within higher level folders. A location-specificweighting may be greater for file locations corresponding to sub-folders(i.e. sub-categories or more specific file locations) than filelocations corresponding to the above folders. In some cases, a user maymanually adjust the location-specific weighting for a stored keywordassociated with a particular file location.

In some embodiments, a keyword coefficient may be determined for thedocument keywords identified in the text data. A keyword coefficient mayindicate a measure of importance of the document keyword to thedocument. For example, the keyword coefficient may be determined usingthe word occurrence level of a keyword in the document. The plurality ofkeyword scores for a particular document may then be generated using thekeyword coefficient. The keyword coefficient can also be used toidentify important keywords indicative of a recommended file name.

In some embodiments, the importance of a document keyword within adocument may be determined based on keyword text attributes of thedocument keyword. Keyword text attributes may include text location,text size, and text formatting for example. For example, the keywordtext location may be determined based on the location or location(s) ofthe document keyword within the document. For example, text located nearthe beginning or top of a page may be identified as of greaterimportance than text further below in the page. Similarly, text size maybe used to determine the importance of a document keyword within adocument. Larger text may indicate keywords that are more important tothe document. Text formatting, such as bolding or underlining may alsoindicate keywords that may be more important to a document.

In some cases, the document may be identified as a particular documenttype from a plurality of document types. The plurality of document typesmay be pre-populated in the system as template document types (e.g.common business forms, papers etc.). The plurality of document types canalso be updated continuously as new documents and new document templatesare stored in the system.

A plurality of document regions may be identified for a document type.For example, document regions may include title regions, header regions,footer regions, body regions, or other regions specific to documenttypes. The document regions within the document may then be associatedwith a regional importance measure for the document type. For example,the title region of a document may be identified as a highly importantregion in various document types.

In other cases, other regions within the document may also be identifiedas being important. For example, a document type such as an income taxdocument may always have the same title but another region, such as aheader region, may include text data that is more descriptive of thespecific document. Accordingly, in such embodiments the header regionmay be identified as a highly important region in that document type.

The keyword coefficients for each of the document keywords in the textdata can be determined based on the document region for that documentkeyword within the document. Document keywords present in one or morehighly relevant regions of the document may have a greater keywordcoefficient than other potential keywords that occur often, but in lessimportant regions of the document.

In some cases, documents may be received as separate electronic files.In other cases, multiple documents may be received as a single file. Insuch cases, some embodiments described herein may automatically identifydistinct documents in the file and separate the pages into thosedocuments. For instance, when multiple documents are scanned as a batchinto a single electronic file, it may be cumbersome for an individual tomanually separate the file into the distinct documents. Accordingly, theembodiments described herein may analyze such a scanned file to identifythe presence of multiple distinct documents and assign the pages to thecorresponding documents.

A plurality of pages may be identified in the received file. Pagemarkers may then be determined for each page. The page markers mayinclude image-based page markers derived from the visual appearance ofthe page. For example, image-based page markers may include stapleposition, angle of text, layout of text, size of text, page numberposition, scan marks, scan lines, page outlines, and others. Theimage-based page markers may also include other image characteristicssuch as the average density or grey scales of the document.

The page markers may also include text-based page markers derived fromthe text data in the document. For example, the text-based page markersmay identify corresponding header/footer data, page numbering, andgrammar indicators (e.g. no punctuation in the text data at the end of apage, no capital letter in the text data at the beginning of a page).The page markers may also include metadata page markers determined frommetadata extracted from the received file.

The embodiments described herein may determine that the file includes aplurality of distinct documents. For example, the plurality of distinctdocuments may be identified based on the page markers identified in theplurality of pages. Each of the pages in the plurality of pages may beassigned to one of the distinct documents by grouping the plurality ofpages using the page markers. In some cases, the grouping of pages andidentification of a plurality of distinct documents may occursubstantially simultaneously, e.g. using clustering or classificationtechniques such as Bayesian classifiers. The embodiments describedherein may then identify suggested file locations for each of thedistinct documents identified.

In some cases, the embodiments described herein may also generate filenames for the electronic documents. For example, when multiple distinctdocuments are received in a single file, new file names may be requiredfor each distinct file. Rather than requiring a user to manually input afile name, or generating a non-descriptive file name, the embodimentsherein may generate file names for the received documents based on thetext data in the documents. For example, the recommended file name maybe determined based on keyword text attributes, such as the textsize/location/formatting discussed above. Similarly, document regions ofmay also be used to identify keywords that may be relevant to thedocument file name. Keyword coefficients can also be used to identifywords that may be relevant to the document file name.

In some cases, the document keywords in a document may be compared tostored file names associated with the plurality of suggested filelocations determined for that document. A recommended file name for adocument may be determined based on the comparison.

The recommended file name may also be determined taking into account therelationship between previously stored file names and the text datawithin the corresponding documents. That is, a file naming conventionmay be determined based on text data from previously stored documents.For instance, if a previously stored document has a title documentregion and a date document region, and text data from those regionsappears in the file name, a similar naming convention may be used toautomatically generate the recommended file name.

In some cases, the file will automatically be given the recommended filename, at least temporarily. A user may be prompted to approve thesuggested file name, or to generate an alternative filename.

In the embodiments described herein, determining suggested filelocations may simplify the task of filing a large number of electronicdocuments in a digital database or digital filing cabinet. Theembodiments described herein may enable a user to more easily andrapidly identify one or more file locations for saving their business orpersonal documents that may facilitate later retrieval. Generatingrecommended file names may further facilitate the management of files,by providing a user with a one-click option for creating or modifying afile name.

Embodiments where multi-document files can be automatically separatedwith the pages grouped into distinct document files may significantlyreduce the time required for users to upload, name and file largenumbers of documents. Rather than scanning many documents separately,multiple documents uploaded in a single scan can be automatically splitand stored as separate documents.

Referring now to FIG. 1A, shown therein is an example embodiment of asystem 100 that may be used for automatic filing of documents. System100 generally comprises a plurality of computers connected via datacommunication network 134, which itself may be connected to theInternet. As shown in FIG. 1A, system 100 includes at least one userdevice 102 that is coupled to a document filing server 120 over network134.

Typically, the connection between network 134 and the Internet may bemade via a firewall server (not shown). In some cases, there may bemultiple links or firewalls, or both, between network 134 and theInternet. Some organizations may operate multiple networks 134 orvirtual networks 134, which can be internetworked or isolated. Thesehave been omitted for ease of illustration, however it will beunderstood that the teachings herein can be applied to such systems.Network 134 may be constructed from one or more computer networktechnologies, such as IEEE 802.3 (Ethernet), IEEE 802.11 and similartechnologies.

Computers and computing devices such as user device 102 and server 120may be connected to network 134 or a portion thereof via suitablenetwork interfaces. In some cases, the user device 102 may connect toserver 120 using network 134 via the Internet. In other cases, the userdevice 102 may be directly linked to server 120, for example, via aUniversal Serial Bus, Bluetooth™ or Ethernet connection.

The user device 102 may be a computer such as a smart phone, desktop orlaptop computer, which can connect to network 134 via a wired Ethernetconnection or a wireless connection. The user device 102 has a processor104, a memory 106 that may include volatile memory and non-volatilestorage, at least one communication interface 112, input devices 110such as a keyboard and trackpad, output devices such as a display 108and speakers, and various other input/output devices as will beappreciated. The user device 102 may also include computing devices suchas a smartphone or tablet computer. Only one user device 102′ is shownin FIG. 1B for ease of illustration, however there may be a plurality ofuser devices 102′ in system 100′.

Processor 104 is a computer processor, such as a general purposemicroprocessor. In some other cases, processor 104 may be a fieldprogrammable gate array, application specific integrated circuit,microcontroller, or other suitable computer processor.

Processor 104 is coupled to display 108, which is a suitable display foroutputting information and data as needed by various computer programs.In particular, display 108 may display graphical user interfaces (GUI),such as the example user interfaces shown in FIGS. 5-7 discussed below.The user device 102 may execute an operating system, such as Apple iOS™,Microsoft Windows™, GNU/Linux, or other suitable operating system.

Communication interface 112 is one or more data network interface, suchas an IEEE 802.3 or IEEE 802.11 interface, for communication over anetwork.

Processor 104 is coupled, via a computer data bus, to memory 106. Memory106 may include both volatile and non-volatile memory. Non-volatilememory stores computer programs consisting of computer-executableinstructions, which may be loaded into volatile memory for execution byprocessor 104 as needed. It will be understood by those of skill in theart that references herein to user device 102 as carrying out a functionor acting in a particular way imply that processor 104 is executinginstructions (e.g., a software program/application) stored in memory 106and possibly transmitting or receiving inputs and outputs via one ormore interface. Memory 106 may also store data input to, or output from,processor 104 in the course of executing the computer-executableinstructions.

As used herein, the term “software application” or “application” refersto computer-executable instructions, particularly computer-executableinstructions stored in a non-transitory medium, such as a non-volatilememory, and executed by a computer processor. The computer processor,when executing the instructions, may receive inputs and transmit outputsto any of a variety of input or output devices to which it is coupled.

For instance, a document management application 114 may be stored on theuser device 102. Although shown separately from memory 106, it will beunderstood that document management application 114 may be stored inmemory 106. In general, the document management application 114 mayprovide a user of the user device 102 with user interfaces forinteracting with and managing storage of documents in document database130. While document management application 114 is shown as beingprovided on the user device 102, the document management application 114may be provided as a cloud application accessible to the user device 102over the Internet using network 134. The document management application114 may communicate with a document analysis application 132 of server120 to assist the server 120 in organizing and managing documents in thedocument database 130.

The server 120 may be a computer such as a desktop or server computer,which can connect to network 134 via a wired Ethernet connection or awireless connection. The server 120 has a processor 124, a memory 126that may include volatile memory and non-volatile storage, at least onecommunication interface 128, and a document database 130. The processor124, memory 126, and communication interface 128 may be implemented ingenerally the same manner as with processor 104, memory 106, andcommunication interface 112 respectively.

Although shown as separate elements, it will be understood that database130 may be stored in memory 126. Optionally, server 120 may includeadditional input or output devices, although this is not required. Aswith all devices shown in system 100, there may be multiple servers 120,although not all are shown. In some cases, server 120 may be distributedover a plurality of computing devices, for instance operating as a cloudserver. As with user device 102, references to acts or functions byserver 120 imply that processor 124 is executing computer-executableinstructions (e.g., a software program) stored in memory 126.

As noted above, memory 126 may also store database 130. In some exampleembodiments, database 130 is a relational database. In otherembodiments, database 130 may be a non-relational database, such as akey-value database, NoSQL database, a graph database, or the like. Insome cases, database 130 may be formed from a mixture of relational andnon-relational databases.

The user device 102 and document filing server 120 may have variousadditional components not shown in FIG. 1A. For example, additionalinput or output devices (e.g., keyboard, pointing device, etc.) may beincluded beyond those shown in FIG. 1A.

Data stored in the database 130 can be arranged into a file directorysystem with a plurality of file locations. The file directory system mayinclude a plurality of folder levels, with high-level folders having oneor more sub-folders that provide for more granular organization offiles. Each file location in the plurality of file locations can beassociated with a particular folder (and thus a particular folderlevel), and may also have secondary associations with each of thefolders above that folder in a hierarchy. The folders and sub-foldersmay reflect categories and sub-categories used to organize documents.Although described as folder levels within a hierarchy, the files neednot be stored in a hierarchical manner, and may instead merely have dataattributes that may be indicative of a relative position in a logicalhierarchy.

The server 120 may store a software application referred to herein as adocument analysis application 132. Although shown separately from memory126, it will be understood that document analysis application 132 may bestored in memory 116. The document analysis application 132 may beconfigured to analyze documents received by document filing server 120to determine suggested file locations in database 130. The documentanalysis application 132 may also be configured to identify and separatedistinct documents within received files. The document analysisapplication 132 may also generate recommended file names for thedocument files.

While document analysis application 132 and document managementapplication 114 are shown as separate applications, it will beunderstood that operations described as being performed by theseapplications may be performed by a single application operating oneither the server 120 or user device 102, or such operations may bedistributed between the user device 102 and server 120.

The document analysis application 132 may identify text data withinreceived documents, for example using optical character recognition. Thetext data may be indexed and analyzed to identify document keywords. Thedocument keywords can be compared against stored keywords such as foldernames within the file directory structure, keywords associated with filelocations and document keywords from text data of other previously saveddocuments to generate keyword scores. The keyword scores can be used tosort potential filing locations based on relevance rankings or bestmatch, and then one or more of the potential file locations can bedisplayed to the user as suggested file locations for storing adocument.

Computer vision and machine learning analysis can be applied to the textdata to determine document keywords and recommended file names for thedocuments received by the system. Page markers, including imagecharacteristics and text data markers, may be used to identify one ormore distinct documents in a received file and to split the pages in thereceived file into the distinct documents.

Referring now to FIG. 1B, shown therein is an example embodiment of asystem 100′ that may be used for automatic filing of documents. System100′ generally comprises a plurality of computers connected via datacommunication network 134, which itself may be connected to theInternet. As shown in FIG. 1B system 100′ includes at least one userdevice 102′ that is coupled to a document filing server 120′ overnetwork 134.

Typically, the connection between network 134 and the Internet may bemade via a firewall server (not shown). In some cases, there may bemultiple links or firewalls, or both, between network 134 and theInternet. Some organizations may operate multiple networks 134 orvirtual networks 134, which can be internetworked or isolated. Thesehave been omitted for ease of illustration, however it will beunderstood that the teachings herein can be applied to such systems.Network 134 may be constructed from one or more computer networktechnologies, such as IEEE 802.3 (Ethernet), IEEE 802.11 and similartechnologies.

Computers and computing devices such as user device 102′ and server 120′may be connected to network 134 or a portion thereof via suitablenetwork interfaces. In some cases, the user device 102′ may connect toserver 120′ using network 134 via the Internet. In other cases, the userdevice 102 may be directly linked to server 120′, for example, via aUniversal Serial Bus, Bluetooth™ or Ethernet connection.

The user device 102′ may be a computer such as a smart phone, desktop orlaptop computer, which can connect to network 134 via a wired Ethernetconnection or a wireless connection. The user device 102′ has aprocessor 104, a memory 106 that may include volatile memory andnon-volatile storage, at least one communication interface 112, inputdevices 110 such as a keyboard and trackpad, output devices such as adisplay 108 and speakers, and various other input/output devices as willbe appreciated. The user device 102′ may also include computing devicessuch as a smartphone or tablet computer. Only one user device 102′ isshown in FIG. 1B for ease of illustration, however there may be aplurality of user devices 102′ in system 100′.

Processor 104 is a computer processor, such as a general purposemicroprocessor. In some other cases, processor 104 may be a fieldprogrammable gate array, application specific integrated circuit,microcontroller, or other suitable computer processor.

Processor 104 is coupled to display 108, which is a suitable display foroutputting information and data as needed by various computer programs.In particular, display 108 may display graphical user interfaces (GUI),such as the example user interfaces shown in FIGS. 5-7 discussed below.The user device 102′ may execute an operating system, such as AppleiOS™, Microsoft Windows™, GNU/Linux, or other suitable operating system.

Communication interface 112 is one or more data network interface, suchas an IEEE 802.3 or IEEE 802.11 interface, for communication over anetwork.

Processor 104 is coupled, via a computer data bus, to memory 106. Memory106 may include both volatile and non-volatile memory. Non-volatilememory stores computer programs consisting of computer-executableinstructions, which may be loaded into volatile memory for execution byprocessor 104 as needed. It will be understood by those of skill in theart that references herein to user device 102′ as carrying out afunction or acting in a particular way imply that processor 104 isexecuting instructions (e.g., a software program/application) stored inmemory 106 and possibly transmitting or receiving inputs and outputs viaone or more interface. Memory 106 may also store data input to, oroutput from, processor 104 in the course of executing thecomputer-executable instructions.

The computer processor, when executing the instructions, may receiveinputs and transmit outputs to any of a variety of input or outputdevices to which it is coupled.

For instance, a document management application 114′ may be stored onthe user device 102′. Although shown separately from memory 106, it willbe understood that document management application 114′ may be stored inmemory 106. In general, the document management application 114′ mayprovide a user of the user device 102′ with user interfaces forinteracting with and managing storage of documents in document database130. While document management application 114′ is shown as beingprovided on the user device 102′, the document management application114′ may be provided as a cloud application (e.g., executed by processor124 of server 120′) accessible to the user device 102′ over the Internetusing network 134. The document management application 114′ maycommunicate with a document analysis application 132′ of server 120′,and with a master node 150 and client node 154, as described herein toorganize and manage documents in the document database 130. In someembodiments, the document management application 114′ may perform thefunctions of client node 154, while in other embodiments there may bedistinct processes for each.

The server 120′ may be a computer such as a desktop or server computer,which can connect to network 134 via a wired Ethernet connection or awireless connection. The server 120′ has a processor 124, a memory 126that may include volatile memory and non-volatile storage, at least onecommunication interface 128, and a document database 130. The processor124, memory 126, and communication interface 128 may be implemented ingenerally the same manner as with processor 104, memory 106, andcommunication interface 112 respectively.

Although shown as separate elements, it will be understood that database130 may be stored in memory 126. Optionally, server 120 may includeadditional input or output devices, although this is not required. Aswith all devices shown in system 100, there may be multiple servers120′, although not all are shown. In some cases, server 120′ may bedistributed over a plurality of computing devices, for instanceoperating as a cloud server. As with user device 102′, references toacts or functions by server 120′ imply that processor 124 is executingcomputer-executable instructions (e.g., a software program) stored inmemory 126.

As noted above, memory 126 may also store database 130. In some exampleembodiments, database 130 is a relational database. In otherembodiments, database 130 may be a non-relational database, such as akey-value database, NoSQL database, a graph database, or the like. Insome cases, database 130 may be formed from a mixture of relational andnon-relational databases.

The user device 102′ and document filing server 120′ may have variousadditional components not shown in FIG. 1B. For example, additionalinput or output devices (e.g., keyboard, pointing device, etc.) may beincluded beyond those shown in FIG. 1B.

Data stored in the database 130 can be arranged into a file directorysystem with a plurality of file locations. The file directory system mayinclude a plurality of folder levels, with high-level folders having oneor more sub-folders that provide for more granular organization offiles. Each file location in the plurality of file locations can beassociated with a particular folder (and thus a particular folderlevel), and may also have secondary associations with each of thefolders above that folder in a hierarchy. The folders and sub-foldersmay reflect categories and sub-categories used to organize documents.Although described as folder levels within a hierarchy, the files neednot be stored in a hierarchical manner, and may instead merely have dataattributes that may be indicative of a relative position in a logicalhierarchy.

The server 120′ may store a software application referred to herein as adocument analysis application 132′. Although shown separately frommemory 126, it will be understood that document analysis application132′ may be stored in memory 116. The document analysis application 132′may be configured to analyze documents received by document filingserver 120′ to determine suggested file locations in database 130. Thedocument analysis application 132′ may also be configured to identifyand separate distinct documents within received files. The documentanalysis application 132′ may also generate recommended file names forthe document files. The document analysis application 132′ maycommunicate with a document management application 114′ of device 102′,and with a master node 150 and client node 154, as described herein toorganize and manage documents in the document database 130. In someembodiments, the document management application 132′ may perform thefunctions of master node 154, while in other embodiments there may bedistinct processes for each.

While document analysis application 132′ and document managementapplication 114′ are shown as separate applications, it will beunderstood that operations described as being performed by theseapplications may be performed by a single application operating oneither the server 120′ or user device 102′, or such operations may bedistributed between the user device 102′ and server 120′.

Similarly, while the master node 150 and client node 154 are shown asbeing provided by server 120′ and user device 102′ respectively, in someembodiments, this functionality may be further combined or subdividedamong different devices. For example, in some embodiments, server 120′may provide both master node 150 and client node 154.

The document analysis application 132′ may identify text data withinreceived documents, for example using optical character recognition.Working with master node 150 and client node 154 as described herein,the text data may be processed to determine suggested or predicted filenames, file locations, or both.

Computer vision and machine learning analysis can be applied to the textdata prior to processing by master node 150 and client node 154. Pagemarkers, including image characteristics and text data markers, may beused to identify one or more distinct documents in a received file andto split the pages in the received file into the distinct documents.

Referring now to FIG. 2A, shown therein is an example of a process 200Afor generating suggested filing locations for documents in accordancewith an example embodiment. Process 200A may be implemented usingvarious computing systems, such as the automatic filing system 100 shownin FIG. 1A. Process 200A may be implemented to assist in a method forautomatic ingestion and filing of documents in a database having aplurality of file locations.

At 210, a file can be received by the document analysis application 132.The received files will generally include at least one document. Forexample, a file may be dragged-and-dropped into an interface of thedocument management application 114. The file may also be uploaded tothe document analysis application 132 from various user devices 102,e.g. an attachment from an email application, a scanned file, ortransferred from a digital camera or mobile device.

In some cases, the received file may include multiple documents. In suchcases, a document splitting or separating method such as method 300shown in FIG. 3 and described below may be used to automaticallyseparate the distinct documents in the received file.

At 220, the document analysis application 132 may identify text data inthe received file. The document analysis application 132 may identifythe text data in each document in the received file(s).

In some cases, the text data may be automatically identifiable if thereceived file already includes electronic text—e.g. if the text data wascreated in a word processing or email application, or if text created byoptical character recognition has been previously added to the receivedfile. In other cases, for instance where the received file is a scannedfile or digital camera image, the document analysis application 132 mayperform further processing on the received file to identify the textdata. The document analysis application 132 can perform an opticalcharacter recognition process on the document to identify the text data.For example, optical character recognition programs such as the opensource tesseract-ocr may be used to identify the text data in adocument.

In some cases, the document analysis application 132 may preprocess thereceived file before identifying the text data. For example, thedocument analysis application 132 may extract metadata from the receivedfile. The extracted metadata may include metadata text data that mayindicate potential keywords.

The document analysis application 132 may also preprocess the receivedfile to identify image-based page markers for the pages in the receivedfile. Image-based page markers are generally determined based on thevisual characteristics of the page, such as image characteristicsdetermined from analysis of the pages. Image-based page markers mayinclude the location of page numbers or staple marks for example. Pagemarkers identified by document analysis application 132 may be used toidentify distinct documents and group related pages, as is discussed inmore detail below with reference to FIG. 3 . In some cases, portions ofthe text data may also be identified as page markers.

At 230, the document analysis application 132 may index the text data toidentify one or more document keywords. For example, indexing programssuch as Apache Solr™ may be used to index the text data.

The document analysis application 132 may also index other page markers,such as the image-based page markers, determined from the received file.For instance, the location of page numbers or staple marks may beindexed to allow the document analysis application 132 to compare suchpage markers between different pages. In some cases, potential keywordsmay also be identified in the extracted metadata. The document analysisapplication 132 may index the potential metadata keywords along with thetext data to identify the one or more document keywords.

At 240, the document keywords identified at 230 can be compared to acorpus of stored keywords. The corpus of stored keywords includeskeywords already stored on the document filing server 120 that are usedto determine potential filing locations for each received document. Eachof the stored keywords has at least one file location associationidentifying a file location associated with that keyword. In some cases,a stored keyword may be associated with multiple file locations.

The corpus of stored keywords can be generated by the document analysisapplication 132 based on documents previously stored in the database.For instance, the corpus of stored keywords may be generated based onindexed text data from previously stored documents. Document keywordsidentified in previously stored documents may be included as storedkeywords associated with the file location to which those previousdocuments were file. Similarly, file names of previously storeddocuments may be used as stored keywords.

The document analysis application 132 can monitor the storage ofdocuments to the document database 130 to update the corpus of storedkeywords. This allows the document analysis application 132 to learnfrom and update the stored keywords to reflect how the user is choosingto store documents.

The corpus of stored keywords can also be generated by the documentanalysis application 132 based on keywords determined from the filedirectory characteristics. The corpus of stored keywords may begenerated based on file location names or folder names in the filedirectory, or from the contents of files in a directory. In some cases,users may also manually input keywords to the corpus of stored keywords.For example, a user may tag a file location or folder with one or moreuser-defined keywords that the user considers particularly relevant tothat folder location. An example of a user interface that may be used toinput user-defined keywords is shown in FIG. 7 and described below.

At 250, a plurality of keyword scores can be generated based on thecomparison of the document keywords and the corpus of stored keywords at240. The keyword scores reflect a relevance of the stored keywords tothe received file, based on the document keywords identified in thedocument. For example, the keyword score may be determined based on asimilarity measure between the document keyword and the stored keyword.

In some cases, the keyword scores can be generated based on a measure ofimportance of a document keyword to the received document. A keywordcoefficient may be determined for each of the document keywords in thetext data identified in the document. The keyword coefficient mayindicate the measure of importance of that keyword to the document. Thekeyword coefficients can be used to generate the keyword score for theparticular keyword. For example, the keyword score for a documentkeyword may be determined by modifying the similarity measure using thekeyword coefficient (e.g. by addition or multiplication).

The keyword coefficient may be determined using the number ofoccurrences (occurrence level) of a keyword in the document. The more aword appears in the document may suggest that the word is moreimportant/relevant to the content of the document.

In some cases, the document keyword coefficient may be determined basedon keyword text attributes for the document keyword. Keyword textattributes may include the text size and/or text location and/or textformatting of the keyword. For example, keywords with a larger text sizemay be given a higher keyword coefficient than keywords with a lowertext size.

The text location of a document keyword may also be used to determinethe keyword coefficient. For example, document keywords identified atthe top of the page, or centered in the page, may have a greater keywordcoefficient than keywords in other parts of the document. The keywordcoefficient may also be determined taking into account a keywordfrequency of the keyword in the document.

The measure of importance of a document keyword to the receiveddocument, and in turn the keyword coefficient, may also be determined byidentifying regions within a document that indicate the keywords are ofgreater importance. For example, keywords in a title region or headerregion may be determined to have greater keyword coefficients. In somecases, a header region or footer region may be determined based onexpected page attributes such as the expected margins based on the pagesize.

The document analysis application 132 may also identify a document typeof the received document. The document analysis application 132 maydetermine that the received document is a particular document type outof a plurality of known document types. The document analysisapplication 132 may identify a plurality of document regions based onthe identified document type. Examples of document regions may include atitle region, header region, identification region, date region, and thelike. Each of the document regions may be associated with a regionalimportance measure for the document type. For example, the title regionmay have a higher regional importance measure than the date region. Thekeyword coefficient for each of the document keywords may be determinedbased on the regional importance measure of the document region for thatdocument keyword in the document.

Pre-existing document types, such as known legal documents and formsfrom different industries may be input to the document analysisapplication 132 to provide the plurality of document types. The documentanalysis application 132 may use these pre-existing document types toidentify a document type of the received file, based on a statisticalsimilarity, layout similarity or other.

At 260, a plurality of suggested file locations can be determined usingthe keyword scores and the file location associations for thecorresponding stored keywords. One or more suggested file locations maythen be displayed to the user in a user interface such as user interface500 shown in FIG. 5 , discussed below.

In some cases, the keyword scores determined at 250 may be weightedbased on an importance of the stored keyword to a particular filelocation. Each stored keyword may have a location-specific weighting foreach of the at least one file location associations for that storedkeyword. The plurality of suggested file locations may then bedetermined by weighting the plurality of keyword scores using thelocation-specific weightings for each of the associated file location.

The document database 130 may have a file directory that includes aplurality of folder levels. Each of the file locations in the pluralityof file locations can be associated with a particular folder level. Insome cases, the location-specific weighting for a particular filelocation association can be determined using the folder level of thefile location corresponding to that file location association. Forexample, the location-specific weighting for a keyword score may beweighted more highly for file location associations at lower, i.e. morespecific, folder levels, than for file locations at higher, more generalfile locations.

In some cases, the location-specific weightings may include user-definedweightings set by a user of user device 102. For example, where a userinputs user-defined keywords to be associated with a file location, theuser may then set a weighting to indicate relative importance of thekeywords to the particular file location. An example user interface 600is shown in FIG. 6 that may allow a user to define weightings for storedkeywords.

Displaying a plurality of suggested filing locations may facilitate auser's bookkeeping and save the user time. It may also provideeducational value by indicating to the user file locations containingsimilar documents. This may identify to the user that various foldersmay include documents that are similar or related, that a user may nototherwise expect.

Because the document analysis application 132 generates suggested filinglocations, users may add documents to the server 120 without beingrequired to immediately determine a filing location. Users may beprompted when the documents are uploaded with suggested filing locationsas shown in FIG. 5 . Users may defer selection of the files, and thedocument analysis applications 132 may store documents temporarily in anunfiled documents folder. In some cases, the document analysisapplication 132 may tentatively file the documents in the top suggestedfile location. In some cases, the users may be periodically promptedwith suggested filing locations for their unfiled or tentatively fileddocuments, spreading the task out to when users have idle time toimprove engagement for document filing.

In some embodiments, the document analysis application 132 may alsodetermine a recommended file name for the document based on the documentkeywords identified in the document. The document analysis application132 may determine keyword coefficients for each of the document keywordsin the text data when determining a recommended file name. The keywordcoefficients may indicate the importance of the corresponding keyword tothe document. The recommended file name may then be identified using thekeyword coefficients of the document keywords. The document analysisapplication 132 may determine the recommended file name using thedocument keywords determined to be most important or relevant to thedocument.

As mentioned above, the keyword coefficients may be determined usingkeyword text attributes, such as text size, text format and textlocation. For example, document keywords identified in a title documentregion, or document keywords with larger text size, may be used togenerate a recommended file name. Keyword coefficients may also bedetermined using the number of occurrences of a keyword in the document.

In some cases, the document analysis application 132 may compare the oneor more document keywords to stored file names associated with theplurality of suggested file locations. The recommended file name may bedetermined based on the file names of documents previously stored to thesame or similar folders. For example, a correspondence or namingconvention may be identified between document keywords of previouslystored files and the file names of those previously stored file. Thefile name for the new file may then be recommended using the identifiednaming convention for the document keywords of the new document.

Generating recommended file names for the files to be stored may allowfor quicker filing for documents that are incorrectly named, or havegeneric/non-descriptive names. This may be particularly useful whenmultiple documents are uploaded with generic names, such as scanneddocuments or images from digital cameras. Furthermore, generatedrecommended file names may be helpful when multi-document files aresplit into multiple distinct files, for example using methods 300 or 400shown in FIGS. 3 and 4 respectively and described below.

In some cases, the document analysis application 132 may adjust thelocations-specific weights for various keywords by ongoing monitoring ofstored documents. The document analysis application 132 may monitor thedocument keywords and file locations of document as they are stored toupdate the location-specific weighting for various keywords. Similarly,the document analysis application 132 may monitor the file names givento stored documents to modify the correspondence between documentkeywords and recommended file names.

Referring now to FIG. 2B, shown therein is an example of a process 200Bfor generating suggested filing locations for documents in accordancewith an example embodiment. Process 200B may be implemented usingvarious computing systems, such as the automatic filing system 100′shown in FIG. 1B. Process 200B may be implemented to assist in a methodfor automatic ingestion and filing of documents in a database having aplurality of file locations.

At 210, a file can be received by the document analysis application132′. The received files will generally include at least one document.For example, a file may be dragged-and-dropped into an interface of thedocument management application 114′. The file may also be uploaded tothe document analysis application 132′ from various user devices 102′,e.g. an attachment from an email application, a scanned file, ortransferred from a digital camera or mobile device.

In some cases, the received file may include multiple documents. In suchcases, a document splitting or separating method such as method 300shown in FIG. 3 and described below may be used to automaticallyseparate the distinct documents in the received file.

At 220, the document analysis application 132′ may identify text data inthe received file. The document analysis application 132′ may identifythe text data in each document in the received file(s).

In some cases, the text data may be automatically identifiable if thereceived file already includes electronic text—e.g. if the text data wascreated in a word processing or email application, or if text created byoptical character recognition has been previously added to the receivedfile. In other cases, for instance where the received file is a scannedfile or digital camera image, the document analysis application 132′ mayperform further processing on the received file to identify the textdata. The document analysis application 132′ can perform an opticalcharacter recognition process on the document to identify the text data.For example, optical character recognition programs such as the opensource tesseract-ocr may be used to identify the text data in adocument.

In some cases, the document analysis application 132′ may preprocess thereceived file before identifying the text data. For example, thedocument analysis application 132′ may extract metadata from thereceived file. The extracted metadata may include metadata text datathat may indicate potential keywords.

The document analysis application 132′ may also preprocess the receivedfile to identify image-based page markers for the pages in the receivedfile. Image-based page markers are generally determined based on thevisual characteristics of the page, such as image characteristicsdetermined from analysis of the pages. Image-based page markers mayinclude the location of page numbers or staple marks for example. Pagemarkers identified by document analysis application 132′ may be used toidentify distinct documents and group related pages, as is discussed inmore detail below with reference to FIG. 3 . In some cases, portions ofthe text data may also be identified as page markers.

At 272, the document analysis application 132′ may communicate with amaster node, such as master node 150 of system 100′, to process the textdata. The master node will be pre-trained using relevant data to haveone or more feature vectors that may be used to process the text dataand to generate first or master prediction data for the text data. Inaddition, the master node may be continuously trained using a feedbackprocess from a plurality of different client nodes (which may correspondto different users), as described herein, for example with reference toFIG. 8 . In this way, system 100′ can generate prediction data (and theplurality of suggested file locations) based on the content of a user'sdocuments as uploaded by the user, and also based on the aggregate ofall user documents that have been uploaded to the system and processedby the master node. Thus, system 100′ can automatically generateassociations between documents to suggest locations by creatingrelations between similar words in other documents. The locations ofdocuments that match highly can be used to return weighted filinglocation predictions.

Master prediction data may be, for example, predicted one or more filenames, predicted file locations, predicted attributes or metadata, etc.For example, if a source document is a monthly bank statement, themaster prediction data may have a bank name, a bank account number, adate of the statement, a predicted file name, a predicted file location(e.g., at a first and second level in a hierarchy), and so on. Ifmultiple predictions are made in a particular category (e.g., multiplefile locations are predicted), the predictions may be ranked accordingto the confidence level of the prediction.

In some cases, if the master node is unable to generate a prediction, orunable to generate a sufficient number of predictions beyond athreshold, the system may revert to a keyword-based approach asdescribed with reference to FIG. 2A.

At 274, the master prediction data and the text data may be transmittedto a client node, such as client node 154 of system 100′, for furtherprocessing.

At 276, the client node may process the text data and the masterprediction data to refine the predictions and thereby generate refinedprediction data. In addition to the master prediction data, the refinedprediction data may have additional file location predictions (e.g.,third level and higher level file location predictions) and moredetailed file name predictions that correspond to a particular user'stastes.

Because the system generates suggested or predicted filing locations, insome cases documents may be filed immediately at 280 upon receiving therefined prediction data, by selecting the highest ranked predictions.

In some other cases, the refined prediction data may be communicated todocument management application 114′ for presentation to the user, anduser input may be received at 278 either confirming the highest rankedpredictions, or else indicating which of the predictions were selected(or if the user overrode the predictions). The documents may then befiled at 280.

Referring now to FIG. 3 , shown therein is an example of a process 300for automatically splitting received documents into a plurality ofdistinct documents in accordance with an example embodiment. Process 300may be implemented in some embodiments of the process 200A or 200B forgenerating suggested filing locations using a computing system such assystem 100 of FIG. 1A or system 100′ of FIG. 1B.

At 310, the at least one document may be received as a single file. Thedocument analysis application 132 or 132′ may receive the file(s) in thesame manner as described above (e.g., at 210 of FIG. 2A or FIG. 2B).

At 320, the document analysis application 132 can identify a pluralityof pages in the received file. For example, the plurality of pages maybe identified based on metadata or image characteristics extracted fromthe received file.

At 330, a plurality of page markers can be determined for each page. Thepage markers may include image-based page markers and/or text-based pagemarkers.

Image-based page markers generally include visual or image-basedcharacteristics that can be identified in the received document. Thatis, the image-based page markers generally reflect the visual appearanceor look of the page. Examples of image-based page markers include pagelayout, page orientation/angle, angle of text, page number position,artifacts such as staple marks, page background characteristics, colorcharacteristics, average density, grey scales and other imagecharacteristics derived from the page.

Text-based page markers refer to page markers derived from the text-datain the document. For instance, text based page markers may includecorresponding titles, corresponding headers/footers, sequential pagenumbering, punctuation/grammar page markers (e.g. a lack of punctuationat the end of a page, no capital letters at the beginning of the nextpage)

The page markers may also include metadata page markers. Metadata pagemarkers can be identified in metadata extracted from a received file.

At 340, the document analysis application may determine that the atleast one document comprises a plurality of distinct documents. At 350,each page can be assigned to one of the distinct documents by groupingthe plurality of pages into the distinct documents by comparing the pagemarkers for the plurality of pages.

In general, steps 340 and 350 may occur concurrently or effectivelysimultaneously. The page markers determined at 330 may be used tocluster or classify pages into groups, for example using Bayesianclassifiers. As a result of this classification process, the documentanalysis application 132 may determine both that there are a pluralityof distinct documents and the assignment of pages to those distinctdocuments.

Once a plurality of distinct documents are identified, and the pages areassigned to the distinct documents, separate electronic files can begenerated for each document. The document analysis application 132 maythen determine suggested filing locations for each of the distinctdocuments, using embodiments of method 200A or 200B described above.Furthermore, as the newly created electronic documents may be initiallyunnamed (or have generic temporary file names), the document analysisapplication 132 may use the processes described herein to generaterecommended filing names to facilitate the creation and storage of suchdocuments.

Referring now to FIG. 4 , shown therein is an embodiment of a flowchart400 showing an example process for automatic ingestion and splitting ofelectronic documents that may be used by system 100A or system 100B. Theprocess shown in flowchart 400 provides a specific example of howprocess 300 described above may be implemented when a PDF document storeis uploaded to the document filing server 120.

The process 400 begins at 402 with a user or client sending or uploadingan electronic file to the document server 120 or 120′. In the exampleshown in FIG. 4 , the electronic file is a PDF document store.

Once the PDF document store is received, the document analysisapplication 132 or 132′ can extract metadata from the received file at406. The document analysis application 132 or 132′ may also separate thePDF document into individual PDF pages using a burst operation. Theindividual PDF pages may then be parsed using a computer visionapplication such as OpenCV to identify image characteristics in each ofthe pages at 410. The computer vision application may identify artifactsor page characteristics which may subsequently be used to identify pagescorresponding to the same document, for example using Hough transforms.One example of such an artifact may be staple marks. Other imagecharacteristics may include page orientation, text angle, color, densityand so forth.

The image characteristics may then be used to pre-process the receivedpages at 414. For example, image processing applications such asImageMagick® may be used to pre-process the received pages. Once thepages have been pre-processed, text data may be identified in the pagesat 418. Where the received pages do not already have identifiable textdata, optical character recognition may be performed using applicationssuch as Tesseract-ocr.

Once identified, the text data may be used to build a feature set, orfeature vectors at 426.

In some embodiments, the text data may then be indexed to identifydocument keywords. The text data may be indexed using indexingapplications such as Apache Solr. The image characteristics identifiedin the received pages may similarly be indexed.

The indexed data for each page can be used to generate feature vectorsfor that page. These feature vectors may then be used to generate a pagecharacteristic index using an application such as Apache Lucene™ and/orElasticsearch™. The page characteristic indexes for each page can thenbe classified, e.g. using Bayesian classifiers in Apache Mahout™ toidentify pages corresponding to the same distinct documents at 430. Thecorresponding pages may then be merged into distinct documents filesbased on the classification.

In some other embodiments, indexing of text data to identify documentkeywords may be omitted, and the raw text data may be input directly tothe nodes of the machine learning system to generate feature vectors forthe raw text data and to perform the classification at 430.

Referring now to FIG. 5 , shown therein is a screenshot of a userinterface 500 that may be displayed to a user. User interface 500 is anexample of a file location selection user interface that may display oneor more suggested file locations to the user of the user device 102 or102′. The user may be able to rapidly select a file location from userinterface 500 to file a document. The user may also choose to select analternative file location.

In some cases, a user may choose to defer selecting a file location forfiles added to the document database 130. Such documents may be storedin an unfiled documents folder, or tentatively stored in the firstsuggested filing location, until the user selects a file location tostore the documents. In such cases, the user may be periodicallyprompted using a user interface such as user interface 500 to selectfile locations for unfiled documents. Alternatively, the user maynavigate to the unfiled documents folder and select a document forfiling. At that time, user interface 500 may be displayed to the user.

Referring now to FIG. 6 , shown therein is a screenshot of a userinterface 600 that may be displayed to a user. A user of user device 102may interact with user interface 600 to adjust the location-specificweightings for various folders and folder levels, and in turn forvarious keywords.

In the example user interface 600 shown in FIG. 6 , the user may adjustthe folder level weight to be applied to folders at a particular folderlevel. In the example shown, the folder-level location-specific weightfor each folder of folder level 1 (i.e. a higher level, likely moregeneric folder) is set to 5 on a scale of 1-100. The weight of a lowerlevel folder may then be set to a higher weight, for example 10 or 15.In some cases, these folder weights may be set or adjusted automaticallyby document analysis application 132. For instance, as additional folderlevels are added, the folder weights may be adjusted to account for thedifferent levels of granularity in the folder levels.

The user may also adjust a weighting threshold value that adjusts howmuch a folder's particular folder level is used to weight the keywordscores generated for particular keywords. In the example user interface600, a higher threshold value places a greater value on the weightassigned to a particular folder level. In the example shown, if theweighting threshold value is set to 0, the keyword scores for aparticular keyword are not adjusted based on the particular folder levelfor the file locations associated therewith. However, the keyword scoresmay still be weighted based on location specific weightings for specifickeywords, such as user-defined location-specific weightings (indicatingthe relative importance/relevance of the stored keyword to that filelocation).

Referring now to FIG. 7 , shown therein is a screenshot of a userinterface 700 that may be displayed to a user. User interface 700 is anexample of a keyword definition user interface that may be used by auser of the user device 102 to define keywords for one or more filelocations. As shown in user interface 700, the document database mayinclude a plurality of categories/folders with different folder levels.The folder levels may indicate whether a particular folder/category is atop level folder/category (i.e. level 1) or whether it is asub-folder/sub-category (e.g. level 2, level 3 and so on). One or moreuser-defined keywords may be associated with the variouscategories/folders.

User-defined keywords may facilitate the determination of suggestedfiling locations when a user is beginning to user the document database(i.e. there are few or none previously stored documents. As additionaldocuments are added to the document database 130, the weight orimportance given to the user-defined keywords may be modified as thedocument analysis application 132 learns from the storage of previousdocuments. Similarly, a user may adjust the weight to be given to aparticular keyword using an interface such as user interface 600.

Referring now to FIG. 8 , shown therein is an example feedback flow inaccordance with some embodiments for feedback-reinforced learning forthe machine learning nodes of system 100B. The process shown inflowchart 800 provides a specific example of how the nodes may beprovided with feedback to reinforce certain learning and to updatefeature vectors to provide improved predictions.

Flow 800 begins with the input of documents 810 a to 810 c at respectiveuser devices 102′. For example, document 810 a may be input via adocument management application 114′ stored on a first user device.Document 810 b may be input via a document management application 114′stored on a second user device, and document 810 c may be input via adocument management application 114′ stored on a third user device.

Each of the first, second and third devices may be instances of userdevice 102′ of system 100B, therefore each of the first, second andthird devices may have respective client nodes 820 a, 820 b and 820 c,which are instances of client node 154 of system 100B.

Each document management application will process the document 810 asdescribed herein, and may generate learning data such as the useridentification of the user of document management application, the textdata of the document 810, and the file name and file location chosen bythe user when filing the document 810. This client node learning datacan then be provided to the respective client node, which can thenupdate its local feature vectors using the output. In some embodiments,the client node learning data may be provided as a JSON object. Forexample, client node learning data from document 810 a may be input tothe client node 820 a of the first user device, client node learningdata from document 810 b may be input to the client node 820 b of thesecond user device, and client node learning data from document 810 cmay be input to the client node 820 c of the third user device.

In this way, each client node can act as a localized artificialintelligence that learns user-specific file naming and filingconventions. That is, client node learning data from document 820 bwhich is filed by a user of the second user device will not affectclient node 820 a, and vice versa.

However, more generalized learning data can still be used to improve theperformance of the master node, such as a master node 150 of system100B. In particular, the master node can be provided with a subset ofthe learning data from each client node 820 a, 820 b and 820 c, whichcan then be used for learning by the master node.

In some embodiments, each client node 820 a to 820 c can output alearning data subset—or master node learning data—to the master node. Insome embodiments, the master node learning data may be a JSON object,and may be pseudo-anonymized by removing the user identification. Themaster node learning data can also be generalized by removing filelocation data higher than the second level. That is, file location datacan be limited only to the first or second level, such as “/Banking/BankAccounts”. Higher level file locations, such as the “MySavings” in“Banking/Bank Accounts/MySavings”, can be stripped. The text data of thesource document may still be provided, however, since it serves as thebasis for the learning.

In some embodiments, master node 830 may also provide learning data toclient nodes 820 a to 820 c. This may occur, for example, when providingsuggested file locations, which can then be used by each client node tofurther refine its respective predictions.

In some embodiments, a document management application may directlyprovide master node learning data to the master node, rather than viathe client node.

The present invention has been described here by way of example only,while numerous specific details are set forth herein in order to providea thorough understanding of the exemplary embodiments described herein.However, it will be understood by those of ordinary skill in the artthat these embodiments may, in some cases, be practiced without thesespecific details. In other instances, well-known methods, procedures andcomponents have not been described in detail so as not to obscure thedescription of the embodiments. Various modification and variations maybe made to these exemplary embodiments without departing from the spiritand scope of the invention, which is limited only by the appendedclaims.

We claim:
 1. A method for automatic ingestion and filing of documents ina database having a plurality of file locations, the method comprising:receiving an electronic file including at least one document; for eachdocument in the at least one document identifying text data in thedocument; and generating a plurality of suggested file locations foreach respective document.
 2. The method of claim 1, wherein generatingthe plurality of suggested file locations comprises: processing the textdata at a master node to generate a plurality of suggested filelocations, wherein the master node is a machine learning node common toa plurality of users; processing the text data at a client node torefine the plurality of suggested file locations for one of theplurality of users, wherein the client node is a machine learning nodespecific to the one of the plurality of users.
 3. The method of claim 2,wherein the plurality of suggested file locations generated at themaster node comprises first or second level file locations in ahierarchy.
 4. The method of claim 2, wherein the plurality of suggestedfile locations generated at the master node comprises third or higherlevel file locations in a hierarchy.
 5. The method of claim 1, whereingenerating the plurality of suggested file locations comprises:comparing the one or more document keywords to a corpus of storedkeywords, the corpus of stored keywords previously generated based on aplurality of documents in the database, wherein each of the storedkeywords in the corpus has at least one file location associationidentifying a file location associated therewith; and generating aplurality of keyword scores based on the comparison of the one or moredocument keywords and the corpus.
 6. The method of claim 1, furthercomprising identifying a plurality of pages in the file; determining aplurality of page markers for each page; determining that the at leastone document in the file comprises a plurality of distinct documents;and assigning each page to one of the distinct documents by grouping theplurality of pages into the distinct documents by comparing the pagemarkers for the plurality of pages.
 7. The method of claim 6, whereinthe page markers comprise image-based page markers derived from a visualappearance of the page.
 8. The method of claim 6, wherein the pagemarkers comprise text-based page markers derived from the text data inthe document.
 9. The method of claim 1, further comprising: for eachstored keyword in the corpus of stored keywords, determining alocation-specific weighting for each file location association; andgenerating the plurality of suggested file locations by weighting theplurality of keyword scores using the location-specific weightings. 10.The method of claim 9, wherein: the database is arranged into a filedirectory having a plurality of folder levels with each file location inthe plurality of file locations associated with a particular folderlevel, and the location-specific weighting for each file locationassociation is determined using the folder level of the file locationcorresponding to that file location association.
 11. The method of claim1, further comprising, for each document in the at least one document:determining a keyword coefficient for each of the document keywords inthe text data, each keyword coefficient indicating a measure ofimportance of the corresponding document keyword to the document; andgenerating the plurality of keyword scores using the keywordcoefficient.
 12. The method of claim 11, wherein the measure ofimportance of the corresponding document keyword to the document isdetermined by: identifying keyword text attributes for the documentkeyword, the keyword text attributes including at least one of a textsize, a text location and a text format; and determining the keywordcoefficient for the document keyword in the text data based on thekeyword text attributes.
 13. The method of claim 1, wherein identifyingthe text data comprises performing optical character recognition on thedocument to identify the text data.
 14. The method of claim 1, furthercomprising, determining a recommended file name for one of the receiveddocuments by: determining a keyword coefficient for each of the documentkeywords in the text data, each keyword coefficient indicating a measureof importance of the corresponding document keyword to the document; anddetermining the recommended file name using the keyword coefficients ofthe document keywords.
 15. A computer program product for automaticingestion and filing of documents in a database, the computer programproduct comprising a non-transitory computer readable storage medium andcomputer-executable instructions stored on the computer readable storagemedium, the instructions for configuring a processor to: receive anelectronic file including at least one document; for each document inthe at least one document identify text data in the document; andgenerate a plurality of suggested file locations for each respectivedocument.